50 research outputs found

    COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors

    Get PDF
    Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin

    Restoration of legacy parallelism : transforming pthreads into farm and pipeline patterns

    Get PDF
    Funding: This work was generously supported by the EU Horizon 2020 project, TeamPlay (https://www.teamplay-h2020.eu), grant number 779882, and UK EPSRC Discovery, grant number EP/P020631/1.Parallel patterns are a high-level programming paradigm that enables non-experts in parallelism to develop structured parallel programs that are maintainable, adaptive, and portable whilst achieving good performance on a variety of parallel systems. However, there still exists a large base of legacy-parallel code developed using ad-hoc methods and incorporating low-level parallel/concurrency libraries such as pthreads without any parallel patterns in the fundamental design. This code would benefit from being restructured and rewritten into pattern-based code. However, the process of rewriting the code is laborious and error-prone, due to typical concurrency and pthreading code being closely intertwined throughout the business logic of the program. In this paper, we present a new software restoration methodology, to transform legacy-parallel programs implemented using pthreads into structured farm and pipeline patterned equivalents. We demonstrate our restoration technique on a number of benchmarks, allowing the introduction of patterned farm and pipeline parallelism in the resulting code; we record improvements in cyclomatic complexity and speedups on a number of representative benchmarks.Publisher PDFPeer reviewe

    COMPROF and COMPLACE : shared-memory communication profiling and automated thread placement via dynamic binary instrumentation

    Get PDF
    Funding: This work was generously supported by UK EPSRC Energise, grant number EP/V006290/1.This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for shared-memory architectures that requires no recompilation or user intervention. We use dynamic binary instrumentation to intercept memory operations and estimate inter-thread communication overhead, deriving (and possibly visualising) a communication graph of data-sharing between threads. We then use this graph to map threads to cores in order to optimise memory traffic through the memory system. Different paths through a system's memory hierarchy have different latency, throughput and energy properties, COMPLACE exploits this heterogeneity to provide automatic performance and energy improvements for multi-threaded programs. We demonstrate COMPLACE on the NAS Parallel Benchmark (NPB) suite where, using our technique, we are able to achieve improvements of up to 12% in the execution time and up to 10% in the energy consumption (compared to default Linux scheduling) while not requiring any modification or recompilation of the application code.Postprin

    Kindergarten Cop : dynamic nursery resizing for GHC

    Get PDF
    Generational garbage collectors are among the most popular garbage collectors used in programming language runtime systems. Their performance is known to depend heavily on choosing the appropriate size of the area where new objects are allocated (the nursery). In imperative languages, it is usual to make the nursery as large as possible, within the limits imposed by the heap size. Functional languages, however, have quite different memory behaviour. In this paper, we study the effect that the nursery size has on the performance of lazy functional programs, through the interplay between cache locality and the frequency of collections. We demonstrate that, in contrast with imperative programs, having large nurseries is not always the best solution. Based on these results, we propose two novel algorithms for dynamic nursery resizing that aim to achieve a compromise between good cache locality and the frequency of garbage collections. We present an implementation of these algorithms in the state-of-the-art GHC compiler for the functional language Haskell, and evaluate them using an extensive benchmark suite. In the best case, we demonstrate a reduction in total execution times of up to 88.5%, or an 8.7 overall speedup, compared to using the production GHC garbage collector. On average, our technique gives an improvement of 9.3% in overall performance across a standard suite of 63 benchmarks for the production GHC compiler.Postprin

    Security And Privacy Of Medical Data:Challenges For Next-Generation Patient-Centric Healthcare Systems

    Get PDF
    This work has been supported by the EU H2020 grant Serums: Securing Medical Data in Smart Patient-Centric Healthcare Systems (code 826278).We describe the recently-started EU H2020 Serums: Securing Medical Data in Smart Patient-Centric Healthcare Systems project that aims to develop novel techniques for safe and secure collection, storage, exchange and analysis of medical data, allowing the patients of the next-generation smart healthcare centers to get the best possible treatment while respecting privacy and ownership of their sensitive personal data. Our goal is to significantly enhance trust in the new medical systems. We outline the techniques that will be extended/developed over the course of the project and describe the use cases that will be used to verify the effectiveness of these technologies in practice.Postprin

    Restoration of legacy parallelism in C and C++ applications

    Get PDF
    Parallel patterns are a high-level programming paradigm that enables non-experts in parallelism to develop structured parallel programs that are maintainable, adaptive, and portable whilst achieving good performance on a variety of parallel systems. However, there still exists a large base of legacy-parallel code developed using ad-hoc methods and incorporating low-level parallel/concurrency libraries such as pthreads without any parallel patterns in the fundamental design. This code would benefit from being restructured and rewritten into pattern-based code. However, the process of rewriting the code is laborious and error-prone, due to typical concurrency and pthreading code being closely intertwined throughout the business logic of the program. In this paper, we present a new software restoration methodology, to transform legacy-parallel programs implemented using e.g. pthreads into structured patterned equivalents. We demonstrate our restoration technique on a number of benchmarks, allowing the introduction of patterned parallelism in the resulting code; we record improvements in cyclomatic complexity and speedups.PostprintPeer reviewe

    Mapping parallel programs to heterogeneous CPU/GPU architectures using a Monte Carlo Tree Search

    Get PDF
    The single core processor, which has dominated for over 30 years, is now obsolete with recent trends increasing towards parallel systems, demanding a huge shift in programming techniques and practices. Moreover, we are rapidly moving towards an age where almost all programming will be targeting parallel systems. Parallel hardware is rapidly evolving, with large heterogeneous systems, typically comprising a mixture of CPUs and GPUs, becoming the mainstream. Additionally, with this increasing heterogeneity comes increasing complexity: not only does the programmer have to worry about where and how to express the parallelism, they must also express an efficient mapping of resources to the available system. This generally requires in-depth expert knowledge that most application programmers do not have. In this paper we describe a new technique that derives, automatically, optimal mappings for an application onto a heterogeneous architecture, using a Monte Carlo Tree Search algorithm. Our technique exploits high-level design patterns, targeting a set of well-specified parallel skeletons. We demonstrate that our MCTS on a convolution example obtained speedups that are within 5% of the speedups achieved by a hand-tuned version of the same application.Postprin

    Efficient dynamic pinning of parallelized applications by reinforcement learning with applications

    Get PDF
    Funding: This work has been partially supported by the European Union grant EU H2020-ICT-2014-1 project RePhrase (No. 644235).This paper describes a dynamic framework for mapping the threads of parallel applications to the computation cores of parallel systems. We propose a feedback-based mechanism where the performance of each thread is collected and used to drive the reinforcement-learning policy of assigning affinities of threads to CPU cores. The proposed framework is flexible enough to address different optimization criteria, such as maximum processing speed and minimum speed variance among threads. We evaluate the framework on the Ant Colony optimization parallel benchmark from the heuristic optimization application domain, and demonstrate that we can achieve an improvement of 12% in the execution time compared to the default operating system scheduling/mapping of threads under varying availability of resources (e.g. when multiple applications are running on the same system).Postprin

    Lapedo : hybrid skeletons for programming heterogeneous multicore machines in Erlang

    Get PDF
    We describe Lapedo, a novel library of hybrid parallel skeletons for programming heterogeneous multi-core/many-core CPU/GPU sys- tems in Erlang. Lapedo’s hybrid skeletons comprise a mixture of CPU and GPU components, allowing skeletons to be flexibly and dynamically mapped to available resources. We also describe a model for deriving near-optimal division of work between CPUs and GPUs, ensuring load balancing between resources. Finally, we evaluate the effectiveness of Lapedo on three realistic use cases from different domains, demonstrating significant speedups compared to executing the same application on only CPU cores or a GPU.Postprin
    corecore